Take-home_Ex03
Vast Challenge 2023 Mini Challenge 3 (Subtask: 3)
Tasks and Questions:
Use visual analytics to understand patterns of groups in the knowledge graph and highlight anomalous groups.
Use visual analytics to identify anomalies in the business groups present in the knowledge graph. Limit your response to 400 words and 5 images.
Develop a visual analytics process to find similar businesses and group them. This analysis should focus on a business’s most important features and present those features clearly to the user. Limit your response to 400 words and 5 images.
Measure similarity of businesses that you group in the previous question. Express confidence in your groupings visually. Limit your response to 400 words and 4 images.
Based on your visualizations, provide evidence for or against the case that anomalous companies are involved in illegal fishing. Which business groups should FishEye investigate further? Limit your response to 600 words and 6 images.
Reflection: What was the most difficult aspect of working with this knowledge graph? Did you have the tools and resources you needed to complete the challenge? What additional resources would have helped you? Limit your response to 300 words
1 About the dataset
1.1 Data dictionary
Node Attributes:
• type – Type of node as defined above.
• country – Country associated with the entity. This can be a full country or a two-letter country code.
• product_services – Description of product services that the “id” node does.
• revenue_omu – Operating revenue of the “id” node in Oceanus Monetary Units.
• id – Identifier of the node is also the name of the entry.
• role – The subset of the “type” node, not in every node attribute.
• dataset – Always “MC3”.
Edge Attributes:
• type – Type of the edge as defined above.
• source – ID of the source node.
• target – ID of the target node.
• dataset – Always “MC3”.
• role - The subset of the “type” node, not in every edge attribute.
1.2 Importing the datasets
Import libraries
The new libraries used today are :
jsonliteto import json filetidytextis to do basic text mining in R
Load the MC3 dataset
Extracting edges
The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.
distinct()is used to ensure that there will be no duplicated records.mutate()andas.character()are used to convert the field data type from list to character.group_by()andsummarise()are used to count the number of unique links.the
filter(source!=target)is to ensure that no record with similar source and target.
There are no duplicates in the mc3_edges dataframe.
Extracting nodes
The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.
Show the code
mutate()andas.character()are used to convert the field data type from list to character.To convert revenue_omu from list data type to numeric data type, we need to convert the values into character first by using
as.character(). Then,as.numeric()will be used to convert them into numeric data type.select()is used to re-organise the order of the fields.
Check for duplicates in mc3_nodes dataframe across all columns and remove them.
2 Initial Data Exploration
2.1 Exploring the edges data frame
In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.
| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
The report above reveals that there are no missing values in all fields.
Why are there source columns with maximum character length of 700?
Checking for the longest length value in source column of edge file.
Show the code
[1] "c(\"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \n\"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Water World Limited Liability Company Freight\", \"Water World Limited Liability Company Freight\")"
It seems like we found more issue with the mc3_edge dataframe. Some of the source column values are still in a list format. We need to unlist the source companies that are hidden.
2.1.1 Cleaning the edges data frame
Now, I would like to split into mc3_edges dataframe into two dataframes, where
df1 : containing rows where the source does NOT contain “c(” <- actual source value, no further processing needed
df2 : containing rows where the source contains values starting with “c(” . As the source column still contains inner list of source entities , it still needs to be unlisted further.
Show the code
# Creating an empty data frame for the two subsets
df1 <- data.frame()
df2 <- data.frame()
# Looping through each row of the original data frame
for (i in 1:nrow(mc3_edges)) {
if (!grepl("c\\(", mc3_edges$source[i])) {
# Append the row to df1 if source does not contain "c("
df1 <- rbind(df1, mc3_edges[i, ])
} else {
# Append the row to df2 if source contains "c("
df2 <- rbind(df2, mc3_edges[i, ])
}
}df1 dataframe
| source | target | type | weights |
|---|---|---|---|
| 1 AS Marine sanctuary | Christina Taylor | Company Contacts | 1 |
| 1 AS Marine sanctuary | Debbie Sanders | Beneficial Owner | 1 |
| 1 Ltd. Liability Co Cargo | Angela Smith | Beneficial Owner | 1 |
| 1 S.A. de C.V. | Catherine Cox | Company Contacts | 1 |
| 1 and Sagl Forwading | Angela Mendoza | Company Contacts | 1 |
df2 dataframe
| source | target | type | weights |
|---|---|---|---|
| c(“1 Ltd. Liability Co”, “1 Ltd. Liability Co”) | Yesenia Oliver | Company Contacts | 1 |
| c(“1 Swordfish Ltd Solutions”, “1 Swordfish Ltd Solutions”, “Saharan Coast BV Marine”, “Olas del Sur Estuary”) | Daniel Reese | Company Contacts | 1 |
| c(“5 Limited Liability Company”, “Bahía de Coral Kga”) | Brittany Jones | Beneficial Owner | 1 |
| c(“5 Limited Liability Company”, “Bahía de Coral Kga”) | Elizabeth Torres | Beneficial Owner | 1 |
| c(“5 Limited Liability Company”, “Bahía de Coral Kga”) | Sandra Roberts | Company Contacts | 1 |
We would need to clean up the df2 dataframe’s source column.
The code chunk below uses the
gsubfunction to remove the unwanted characters from thesourcecolumn. It replaces any occurrence of"orc(or)with an empty string, ’’trimwsfunction to remove any leading or trailing whitespace from the cleanedsourcecolumn,\\: The backslash is an escape character in regular expressions. In this case, it is used to escape the closing parenthesis character, so it is treated as a literal character in the pattern.
Show the code
| source | target | type | weights |
|---|---|---|---|
| 1 Ltd. Liability Co, 1 Ltd. Liability Co | Yesenia Oliver | Company Contacts | 1 |
| 1 Swordfish Ltd Solutions, 1 Swordfish Ltd Solutions, Saharan Coast BV Marine, Olas del Sur Estuary | Daniel Reese | Company Contacts | 1 |
| 5 Limited Liability Company, Bahía de Coral Kga | Brittany Jones | Beneficial Owner | 1 |
| 5 Limited Liability Company, Bahía de Coral Kga | Elizabeth Torres | Beneficial Owner | 1 |
| 5 Limited Liability Company, Bahía de Coral Kga | Sandra Roberts | Company Contacts | 1 |
For each row, extract each entity in Source column and insert this entity as a new row.
Show the code
# Create a new data frame for the modified rows
df2_modified <- data.frame()
# Loop through each row of df2
for (i in 1:nrow(df2)) {
# Split the source value by comma
source_values <- unlist(strsplit(df2$source[i], ", "))
# Create a new row for each source value
for (value in source_values) {
# Create a new row with the same "target", "type", and "weights" values
new_row <- data.frame(
source = value,
target = df2$target[i],
type = df2$type[i],
weights = df2$weights[i]
)
# Append the new row to df2_modified
df2_modified <- rbind(df2_modified, new_row)
}
}
# Print the modified data frame
#cat("df2_modified:\n")
kable(head(df2_modified,10))| source | target | type | weights |
|---|---|---|---|
| 1 Ltd. Liability Co | Yesenia Oliver | Company Contacts | 1 |
| 1 Ltd. Liability Co | Yesenia Oliver | Company Contacts | 1 |
| 1 Swordfish Ltd Solutions | Daniel Reese | Company Contacts | 1 |
| 1 Swordfish Ltd Solutions | Daniel Reese | Company Contacts | 1 |
| Saharan Coast BV Marine | Daniel Reese | Company Contacts | 1 |
| Olas del Sur Estuary | Daniel Reese | Company Contacts | 1 |
| 5 Limited Liability Company | Brittany Jones | Beneficial Owner | 1 |
| Bahía de Coral Kga | Brittany Jones | Beneficial Owner | 1 |
| 5 Limited Liability Company | Elizabeth Torres | Beneficial Owner | 1 |
| Bahía de Coral Kga | Elizabeth Torres | Beneficial Owner | 1 |
Question to think about: ‘Do we aggregate the weights for each ’source’ -‘target’ pair or treat them as duplicates?
For now, let us aggregate and store the number of occurrence in ‘weights’ column. Now we have df2_modified dataframe where each ‘source-target’ pair is unique and have no duplicates.
Show the code
| source | target | type | weights |
|---|---|---|---|
| Niger River Delta S.p.A. | Cole Allen | Company Contacts | 20 |
| Niger River Delta S.p.A. | Shawn Myers | Company Contacts | 20 |
| Balti Sprat Inorporated Investment | Jose Ramirez | Company Contacts | 14 |
| Cape Verde Islands Pl Otter | Duane Edwards | Company Contacts | 14 |
| Cape Verde Islands Pl Otter | Jill Newman | Beneficial Owner | 14 |
| Greek Makerel Ltd. Corporation Express | Megan Wyatt | Company Contacts | 13 |
Finally, we will appended df1 below df2_modified to get our edges_combined data frame, now with 24,935 rows after unlisting all source entities.
In the code chunk below, datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.
Show the code
Let us use bar chart to visualise the type column. (Improve on this bar chart later)
The edges_combined df contains relationship between individuals and companies, and describe whether an individual is a Beneficial Owner or a Company Contact of a company.
2.1.2 Build a node dataframe using new edge dataframe
Rebuild a brand new mc3_nodes1 dataframe (enhanced version of mc3_nodes dataframe), by appending unique target and source entities from the edges_combined dataframe to get a column id.
This newly created dataframe only has 1 column id. We can bring in nodes attributes like country, type, revenue_omu, product_services from the original mc3_nodes dataframe into the mc3_nodes1 dataframe.
Show the code
id1 <- edges_combined %>%
select(source) %>%
rename(id = source)
id2 <- edges_combined %>%
select(target) %>%
rename(id = target)
mc3_nodes1 <- rbind(id1, id2) %>%
distinct() %>% #<< ensures no duplicated in `id` column
left_join(mc3_nodes, by='id',
unmatched = "drop") #<<< bring in 4 node attributes from `mc3_nodes` dfAfter the left join, mc3_nodes1 df increased in number of rows from 34,443 to 35,767 rows because entities like Adams LLC has both company contacts and beneficial owner types in mc3_nodes df.
Let us take a sneak peak at the mc3_nodes1 df now. We notice the following:
‘Andrew Yu’ does not have value in the
typecolumn. However, its type value can be retrieved from theedges_combineddf.‘Andrews PLC’ as a single entity has two types, namely ‘Company Contacts’ and ‘Beneficial Owners’.
| id | country | type | revenue_omu | product_services |
|---|---|---|---|---|
| Andrew Woodward | NA | NA | NA | NA |
| Andrew Yu | NA | NA | NA | NA |
| Andrews LLC | ZH | Company | 72609.01 | Food preparations and kindred products |
| Andrews Ltd | ZH | Company Contacts | NA | character(0) |
| Andrews PLC | ZH | Beneficial Owner | NA | character(0) |
| Andrews PLC | ZH | Company Contacts | NA | character(0) |
| Andrews and Sons | ZH | Company | NA | Unknown |
| Andrews and Sons | ZH | Company Contacts | NA | character(0) |
Looking into the edges_combined df where the type attribute of ‘Andrew Yu’ is stored.
He is a beneficial owner of ‘Mar del Paraíso GmbH’.
| source | target | type | weights |
|---|---|---|---|
| Mar del Paraíso GmbH | Andrew Yu | Beneficial Owner | 1 |
We would like to ingest his relationship type with ‘Mar del Paraíso GmbH’ into the mc3_nodes1 dataframe.
We can further enrich each id’s type attribute of the mc3_nodes1 dataframe by extracting the type values of the target entity from the edges_combined dataframe. The number of records further increased from 35,767 to 39,437 because each target entity in edges_combined df can have more than 1 type. An example is ‘Aaron Garcia’ - who is both a company contacts and a beneficial owner type to a company.
Show the code
# Perform a left join to bring in the "type" values from 'edges_combined` df to 'mc3_nodes1` df
merged_df <- left_join(mc3_nodes1, edges_combined, by = c("id" = "target"))
# Replace NA values in the "type.x" column with non-NA values from the "type.y" column
merged_df$type.x[is.na(merged_df$type.x)] <- merged_df$type.y[is.na(merged_df$type.x)]
# Select the relevant columns and rename "type.x" to "type"
merged_df <- merged_df %>%
select(id, country, type = type.x, revenue_omu, product_services)The code chunk above is used to replace the NA values in the “type.x” column of the merged_df dataframe with non-NA values from the “type.y” column.
merged_df$type.xrefers to the “type.x” column of themerged_dfdataframe. This is the column where we want to replace the NA values.is.na(merged_df$type.x)creates a logical vector withTRUEfor NA values andFALSEfor non-NA values in the “type.x” column.merged_df$type.y[is.na(merged_df$type.x)]uses the logical vector as an index to select only the non-NA values from the “type.y” column corresponding to the NA values in the “type.x” column.
Note that the type attribute of ‘Andrew Yu’ and similar others have been added to this df.
| id | country | type | revenue_omu | product_services |
|---|---|---|---|---|
| Andrew Woodward | NA | Beneficial Owner | NA | NA |
| Andrew Yu | NA | Beneficial Owner | NA | NA |
| Andrews LLC | ZH | Company | 72609.01 | Food preparations and kindred products |
| Andrews Ltd | ZH | Company Contacts | NA | character(0) |
| Andrews PLC | ZH | Beneficial Owner | NA | character(0) |
| Andrews PLC | ZH | Company Contacts | NA | character(0) |
| Andrews and Sons | ZH | Company | NA | Unknown |
| Andrews and Sons | ZH | Company Contacts | NA | character(0) |
Check for duplicates. There are once again duplicates after mc3_nodes1 left outer join with edges_combined df because in edges_combined df, there are many relationship types for each individual and its associated company. (E.g check Cole Allen).
We will drop the duplicates in merged_df df, leaving us with 36,731 rows. merged_df is now an enhanced version of my main nodes file. It contains nodes (seafood and non-seafood related) that are present in the edges_combined df. In addition, node attribute type column was also enhanced into this df from edges_combined df. Using datatable() of DT pacakge, let us take a look at merged_df as an interactive table on the html document using.
Show the code
Finally, the nodes and edges dataframes are ready! We will use the merged_df as our main nodes file and edges_combined as main edges file from now. Before further cleaning, with what we have now, lets build an initial visualisation of the network.
2.1.3 Building network model with tidygraph
First, create a graph object using tbl_graph() function. Then calculate betweenness and closeness centrality scores.
2.1.4 Visualising network graph
We will filter nodes with high betweenness centrality scores (>2,000,000) and visualise them to see the relationships that they have.
Show the code
set.seed(1234)
mc3_graph %>%
filter(betweenness_centrality >= 2000000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(#width= weights,
alpha=0.5)) +
geom_node_point(aes(
size = betweenness_centrality,
#color = type,
alpha = 0.3)) +
geom_node_label(aes(label = id),repel=TRUE, size=2.5, alpha = 0.8) +
scale_size_continuous(range=c(1,10)) +
theme_graph() +
labs(title = 'Initial network visualisation \n(Seafood and non-seafood)',
subtitle = 'Entities with betweenness scores > 2,000,000')
Below is a dataframe showing us the top 10 entities with the highest betweenness scores.
Show the code
| id | country | type | revenue_omu | product_services | betweenness_centrality | closeness_centrality | membership |
|---|---|---|---|---|---|---|---|
| Wave Warriors S.A. de C.V. Express | Lumindoria | Company | 1761580.75 | Beer, ale, and malt liquors, as well as nonalcoholic beer and other related products | 3981896 | 1.31e-05 | 6 |
| Dutch Oyster Sagl Cruise ship | Marebak | Company | 62913.42 | Import and export of textiles, knitwear and raw materials | 3878339 | 1.34e-05 | 6 |
| Senegal Coast Ltd. Liability Co | Oceanus | Company | NA | Unknown | 3725057 | 1.34e-05 | 6 |
| Limpopo River Ltd. Liability Co | Marebak | Company | 12065.26 | Meat and processed meat products | 3632403 | 1.30e-05 | 6 |
| Ocean Observers Marine mist | Puerto Sol | Company | 39678.54 | Transportation and other related services | 3559968 | 1.24e-05 | 6 |
| Matthew Reynolds | NA | Company Contacts | NA | NA | 3484895 | 1.31e-05 | 6 |
| Niger Bend AS Express | Puerto Sol | Company | 613590.73 | Cruise ship holidays | 3479011 | 1.32e-05 | 6 |
| Luangwa River Limited Liability Company Holdings | Sol y Oceana | Company | NA | Chemicals and allied products, such as acids, industrial and heavy chemicals, dyestuffs, industrial salts, rosin, and turpentine | 3305332 | 1.23e-05 | 6 |
| Jennifer Smith | NA | Beneficial Owner | NA | NA | 3142993 | 1.25e-05 | 6 |
| Coral del Mar SE United | Oceanus | Company | NA | Unknown | 3133167 | 1.34e-05 | 6 |
The top 10 betweenness entities above are not dealing with seafood related industries directly. In the next section, we will filter entities from the merged_df dataframe for only seafood related entities. We may revisit the non-seafood entities later when we have specific targets/companies to investigate.
HIVE plot (Later )
3 Exploring the merged_df nodes data frame
In the code chunk below, skim() of skimr package is used to display the summary statistics of merged_df tibble data frame. We will not be exploring mc3_nodes dataframe as merged_df dataframe is an enhanced version of it.
| Name | merged_df |
| Number of rows | 36731 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 6 | 64 | 0 | 34442 | 0 |
| country | 30045 | 0.18 | 2 | 14 | 0 | 85 | 0 |
| type | 7816 | 0.79 | 7 | 16 | 0 | 3 | 0 |
| product_services | 30045 | 0.18 | 4 | 1737 | 0 | 2052 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 33158 | 0.1 | 877262.3 | 11772097 | 3652.23 | 8375.22 | 17466.67 | 50229.65 | 310612303 | ▇▁▁▁▁ |
The report above reveals that there are many missing values in country, type, product_services columns. In addition, that there are several ‘Unknown’ values in the product_services column, which is a good as missing. This section focuses on cleaning the product_services columns by first performing text sensing and adding a label to each company as either ‘seafood’, ‘others’ or ‘unknown’.
The report above also reveal that there are 91% Na values in the revenue_omu column.
Check the distribution of the type column in merged_df dataframe.
Show the code
merged_df$type1 <- ifelse(is.na(merged_df$type), "Missing", merged_df$type)
merged_df$type1 <- as.factor(merged_df$type1)
ggplot(data = merged_df,
aes(x = reorder(type1,type1,
function(x)+length(x)))) +
geom_bar(fill='lightblue') +
ylim(0,20000) +
geom_text(stat="count",
aes(label=paste0(..count.., ", ",
round(..count../sum(..count..)*100, 1), "%")),
hjust=-0.1,
size = 5) +
coord_flip() +
theme_minimal() +
labs(x = 'Relationship type between entities',
title = 'Distribution of business relationships') +
theme(plot.title = element_text(size = 22,
face='bold',
hjust = 0.5),
axis.title.x=element_text(size= 20,
hjust = 0.5),
axis.title.y=element_text(size= 20),
axis.text.x = element_text(size = 15,
color = "black",
angle = 0,
hjust = 0.5,
vjust = 0.5),
axis.text.y = element_text(size = 15,
color = "black",
angle = 0,
hjust = 0.5,
vjust = 0.5),
panel.grid.major.y = element_blank() )
3.1 Text Sensing with tidytext
In section 2.1.2, we saw in merged_df dataframe that the product_services column contains raw text data on products that each entity provides. We would like to give each company a meaningful label based on its product_services.
Hence in this section, we will perform basic text sensing using appropriate functions of tidytext package.
3.1.1 Simple word count
The code chunk below calculates number of times the word fish appeared in the field product_services.
Show the code
# A tibble: 6 × 7
id country type revenue_omu product_services type1 n_fish
<chr> <chr> <chr> <dbl> <chr> <fct> <int>
1 Gvardeysk Sextant ОАО… Uziland Comp… 73027. Fish salads (It… Comp… 11
2 Taylor LLC ZH Comp… 138982. Fish (anchovy, … Comp… 11
3 SeaSelect Foods Salt … Marebak Comp… 41902. European whole … Comp… 7
4 Samaka Chart ОАО Deli… Nalako… Comp… 16207. Live crayfish, … Comp… 6
5 suō yú Ltd. Liability… Coralm… Comp… 31567. Offers a wide r… Comp… 6
6 Arunachal Pradesh s S… Marebak Comp… 60346. Offers a wide r… Comp… 6
3.1.2 Tokenisation
The word tokenisation have different meaning in different scientific domains. In text sensing, tokenisation is the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenisation, some characters like punctuation marks may be discarded. The tokens usually become the input for the processes like parsing and text mining.
In the code chunk below, unnest_token() of tidytext is used to split text in product_services columninto words.
The two basic arguments to
unnest_tokens()used here are column names.First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (product_services, in this case).
Before tokenising, the NA values under
product_serviceshave been replaced by ‘unknown’ string.By default, punctuation has been stripped.
By default,
unnest_tokens()converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the to_lower = FALSE argument to turn off this behavior).
3.1.3 Removing stopwords
We will use the tidytext package’s function called stop_words that will help us clean up stop words. In addition, we can add in additional stopwords in the stopwords list.
Show the code
# Create a new dataframe with customised stopwords
new_stopwords <- data.frame(word = c("unknown", "character", "0", "products","range", "offers","including"))
# Add the new stopwords to the existing stop_words dataframe
stop_words <- bind_rows(stop_words, new_stopwords)
# remove stopwords from `token_nodes`
stopwords_removed <- token_nodes %>%
anti_join(stop_words)There are two processes:
Load the stop_words data included with tidytext. This data is simply a list of words that you may want to remove in a natural language analysis.
Then
anti_join()of dplyr package is used to remove all stop words from the analysis.
Now we can visualise the words extracted by using the code chunk below.
Show the code
stopwords_removed %>%
count(word, sort = TRUE) %>%
top_n(50) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field") +
theme(axis.text.y = element_text(size = 7.5, hjust = 0))
The diagram above shows that some words like ‘fish’, ‘seafood’, ‘salmon’ are related to the seafood industry and we could tag companies containing these words to the label ‘seafood’. The non-seafood related industry companies could be tag to the label ‘others’ while missing product services could be tag to ‘unknown’.
3.2 Preparing a master node dataframe with unique id per row
3.2.1 Adding a label column using product_services column
Creating nodes_all_notagg dataframe where this df contains a new column called label that groups all the fishing related companies together and non-seafood related companies together. There are three categories in label column namely ‘seafood’, ’ others’, ‘unknown’. nodes_all_notagg dataframe is created from merged_df.
Note that in nodes_all_notagg dataframe , there can be duplicated ids bacause an entity can be associated to more than 1 country, 1 relationship type and 1 product labels group.
(Refer to ‘Manipur Market Ltd. Liability Co’ if needed)
Show the code
#library(stringr)
# Define the seafood keywords
seafood_keywords <- c("sea food", "seafood", "fish", "prawn", "shrimp", "shell", "crab", "lobster", "mussel", "cavier", "oyster", "octopus", "squid", "aquatic", "crayfish", "tuna", "salmon", "scallop", "mackerel", "trout", "sardine", "winkle", "Barramundi", "tilapia")
# Add the 'label' column based on the 'product_services' column
nodes_all_notagg <- merged_df %>%
mutate(product_services = ifelse(is.na(product_services), 'Unknown', product_services)) %>%
mutate(label = ifelse(product_services == 'Unknown' | product_services == 'character(0)', 'unknown',
ifelse(str_detect(product_services, regex(paste(seafood_keywords, collapse = "|"), ignore_case = TRUE)), 'seafood', 'others')))In the code chunk above, we first check if product_services is equal to either “Unknown” or “character(0), if true, ‘label’ is set to ‘unknown’.
Next, if product_services contains seafood_keywords, then ‘label’ is set to ‘seafood’.
If none of the previous conditions are met, ‘label’ is set to ‘others’.
nodes_all_notagg df can be used with edges_combined edge file
Visualising the newly created label column using bar charts. The results are as expected because most of the Beneficial Owners and Company Contacts do not have values under the product_services column.
Show the code
ggplot(data = nodes_all_notagg,
aes(x = reorder(label,label,
function(x) + length(x)))) +
geom_bar(fill='lightblue') +
ylim(0,50000) +
coord_flip() +
geom_text(stat="count",
aes(label=paste0(..count.., ", ",
round(..count../sum(..count..)*100, 1), "%")),
hjust= -0.1,
size = 5) +
theme_minimal() +
labs(x = 'Product services labels',
title = 'Distribution of labels') +
theme(plot.title = element_text(size = 22,
face='bold',
hjust = 0.5),
axis.title.x=element_text(size= 20,
hjust = 0.5),
axis.title.y=element_text(size= 20),
axis.text.x = element_text(size = 15,
color = "black",
angle = 0,
hjust = 0.5,
vjust = 0.5),
axis.text.y = element_text(size = 15,
color = "black",
angle = 0,
hjust = 0.5,
vjust = 0.5),
panel.grid.major.y = element_blank() )
3.2.2 Aggregation to ensure no duplicates of id in master node df
As explained earlier , the entity names in the id column of the nodes_all_notagg dataframe is not unique.
Thus from nodes_all_notagg dataframe, we will create an aggregated version of it called nodes_all_agg dataframe (our master node df) , where there is only 1 instance of each id. It would mean aggregation is needed in order not to lose any information.
First, group by id and create new columns :
country_qty : number of countries that an entity is associated with
type_qty : number of business relationship types an entity has
revenue_sum: total revenue of entity
label_qty: number of categories of products associated with entity
country_concat : list of all countries associated with entity
type_concat : list of all business relationship types associated with entity
label_concat : list of all product labels associated with entity
Show the code
nodes_all_agg <- nodes_all_notagg %>%
group_by(id) %>%
summarise(country_qty = ifelse(all(is.na(country)), 0, n_distinct(country)),
type_qty = ifelse(all(is.na(type)), 0, n_distinct(type)),
label_qty = n_distinct(label),
revenue_sum = sum(revenue_omu, na.rm=TRUE),
country_concat = ifelse(country_qty > 1, paste(unique(country), collapse = ', '), country[1]),
type_concat = ifelse(type_qty > 1, paste(unique(type), collapse = ', '), type[1]),
label_concat = ifelse(label_qty > 1, paste(unique(label), collapse = ', '), label[1]),
product_services_concat = paste(product_services, collapse = "| ")
)The codes above checks for each entity
countryvalues contains all null. If so,country_qytcolumn is 0. Otherwise, the number of distinct countries is computed usingn_distinct().the number of distinct countries (
country_qty) is greater than 1. If it is, it usespaste()withunique(country)to concatenate only the uniquecountryvalues, separated by a comma and a space. Otherwise, it simply uses the firstcountryvalue in the group (country[1]). This ensures that thecountryvalues are concatenated only if they are different within each group.
Take a peek at a few rows in `nodes_all_agg’. It contains information of seafood and non-seafood related entities and has 34,442 rows in total.
| id | country_qty | type_qty | label_qty | revenue_sum | country_concat | type_concat | label_concat | product_services_concat |
|---|---|---|---|---|---|---|---|---|
| Oceanfront Oasis GmbH & Co. KG Carriers | 2 | 1 | 2 | 138105.75 | Brindivaria, Oceanus | Company | others, unknown | Lotions and other skin care products, cosmetics, perfumes and toilet preparations, and other personal hygiene products| Unknown |
| Océano del Este ОАО | 2 | 1 | 2 | 136636.13 | Solovarossa, Oceanus | Company | seafood, unknown | Offers a wide range of products such as salmon and salmon eggs, parr and smolt, and various other species, including cobia and cod juveniles| Unknown |
| Odisha Ltd. Liability Co | 2 | 1 | 2 | 93427.09 | Oceanus, Marebak | Company | unknown, seafood | Unknown| A range of fish and other related seafood products| Unknown |
| Oka Ltd. Corporation Transport | 2 | 1 | 1 | 40152.33 | Korvelonia, Oceanus | Company | seafood | Fish and fish products| Pink and chum salmon |
| Ola Azul Ges.m.b.H. Services | 2 | 1 | 2 | 59994.96 | Nalakond, Zawalinda | Company | seafood, unknown | Seafood product preparation and packaging| Unknown |
| Ola del Mar SRL | 2 | 1 | 2 | 33526.70 | Zawalinda, Puerto del Mar | Company | others, unknown | Primarily involved in providing air, surface, or combined courier delivery services of parcels generally between metropolitan areas or urban centers| Unknown |
3.2.3 seafood_entities df
In this section, we will subset the nodes_all_agg dataframe above by filtering label == ‘seafood’ to extract only seafood-related entities information.
Filter for ids with label_concat column containing seafood
Take a peek at a few rows in seafood_entities. There are 651 seafood related entities.
| id | country_qty | type_qty | label_qty | revenue_sum | country_concat | type_concat | label_concat | product_services_concat |
|---|---|---|---|---|---|---|---|---|
| Bahía de Plata Trout | 1 | 1 | 1 | 16349.68 | Rio Isla | Company | seafood | Grocery products (Canned and frozen foods, milk, fresh fruits and vegetables, fresh and prepared meats, fish, poultry and poultry products); Other household products (alcohol, household cleaning products, medicine, and clothes) |
| Bahía del Sol Deep-sea | 1 | 1 | 1 | 36095.44 | Nalakond | Company | seafood | Fish and seafood products (tuna, salmon, herring, shellfish, and groundfish products; and flounder fillets, cornmeal pollock strips, burger, tuna steak, frozen halibut steaks, as well as canned sockeye salmon and frozen sockeye, and crabs) |
| Bahía del Sol Kga Consulting | 1 | 1 | 1 | 14016.58 | Vesperanda | Company | seafood | Seafood products |
| Baker LLC | 1 | 3 | 2 | 11016.82 | ZH | Company, Beneficial Owner, Company Contacts | seafood, unknown | Canned fish and seafood products| character(0)| character(0) |
| Baker and Sons | 1 | 2 | 2 | 104095830.23 | ZH | Company, Beneficial Owner | seafood, unknown | Fish; fresh or chilled, mackerel (Scomber scombrus, Scomber australasicus, Scomber japonicus), excluding fillets, fish meat of 0304, and edible fish offal of subheadings 0302.91 to 0302.99| character(0) |
| Balkan GmbH & Co. KG Cargo | 1 | 1 | 1 | 29457.13 | Nalakond | Company | seafood | Seafood and related products |
3.2.4 seafood_edges df
Next, we need to filter only the relevant relationship from the edges_combined df according to seafood_entities df .
Filter the relevant rows from edges_combined df where either of its target or source values are found in id column of seafood_entities dataframe. This would filter out all relationship related to seafood related entities. There are no duplicates in seafood_edges df.
3.2.5 seafood_nodes df
Using seafood_edges df, create a node file called seafood_nodes for network visualisation later.
A left join with nodes_all_agg is required to ingest all the nodes attributes.
Show the code
# Extract unique ids from 'source' and 'target' columns of 'seafood_edges' dataframe
unique_entities <- unique(c(seafood_edges$source, seafood_edges$target))
# Create a new 'seafood_nodes' dataframe as a tibble
seafood_nodes <- as_tibble(data.frame(id = unique_entities))
# Perform left outer join with my nodes master table 'nodes_all_agg'
seafood_nodes <- left_join(seafood_nodes, nodes_all_agg, by = "id")4 The analysis (Seafood)
Seafood_graph tbl object
The seafood_nodes and seafood_edges dfs will be the main files used to create a tbl graph object. The tbl_graph function from tidygraph library will be used.
The seafood_graph object has 3114 nodes and 2505 edges. There are 609 subgraphs inside.
In addition, we will calculate various centrality scores and also add a column called membership for us to recognise which subgraph each id belongs to.
Show the code
seafood_graph<- tbl_graph(nodes=seafood_nodes,
edges = seafood_edges,
directed = FALSE)
seafood_graph <- seafood_graph %>%
activate(nodes) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness(),
eigenvector_centrality = centrality_eigen(),
membership = components(seafood_graph)$membership)seafood_graph nodes interactive datatable below to show all the additional centrality and membership attributes calculated using wrapper functions of the tidygraph package. You could sort the various columns inside.
We will build a network graph that contains only seafood_related business relationships. The graph will show only entities with high betweenness scores.
Prepare edge file for visNetwork library. Rename source to from and target to to.
Prepare node file for visNetwork library. By renaming type_concat to group, visnetwork could help us to colour the nodes by business relationship types.
Let us visualise the nodes with betweenness score 15 and above.
Show the code
set.seed(1234)
visNetwork(seafood_nodes_vis %>% filter(betweenness_centrality >= 15),
seafood_edges_vis,
main = "Interactive Network graph of top betweenness entities",
height = "500px", width = "100%") %>%
visIgraphLayout(layout = "layout_nicely") %>%
visEdges(smooth = list(enables = TRUE,
type= 'straightCross'),
shadow = FALSE,
dash = FALSE) %>%
visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visInteraction(hideEdgesOnDrag = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)All the entities shown above have relatively higher betweenness scores than other entities in the entire network. Hovering the mouse over each node and edge will reveal attributes like betweenness scores, country associated each node with and its revenue.
At a glance, there are two subgraphs (bottom) with 5 high betweenness nodes connected to one another, forming a larger than normal network. We are interested in subgraphs with longer network diameter like these because they represent more complex business relationship. For instance, in such complex subgraphs, entities involved are ‘company’, ‘business owners’ and ‘company contacts’ and there are also more than 1 company in the network. From literature review, having entities associated to two or more companies (conflict of interests could occur) could suggests transshipment activities. This could allows us to investigate the entities in the complex subgraphs to check for any IUU crime.
Instead of having to ‘eye-ball’ each of 609 the seafood network subgraphs to identify those with longer network diameter, we can extract edge data of each subgraph (using membership column) and compile the network diameter of each subgraph into a dataframe.
Calculate network diameter
Show the code
# Get unique membership values
unique_memberships <- unique(seafood_nodes_vis$membership)
# Initialize empty list to store results
results <- list()
# Iterate over each membership value
for (x in unique_memberships) {
# Filter nodes based on membership
nodes <- seafood_nodes_vis %>%
filter(membership == x)
# Filter edges based on nodes
edges <- seafood_edges_vis %>%
filter(from %in% nodes$id | to %in% nodes$id)
# Create subgraph
subgraph <- as_tbl_graph(edges, directed = FALSE)
# Calculate network diameter
diameter <- with_graph(subgraph, graph_diameter())
# Store results in list
results[[as.character(x)]] <- diameter
}
# Create DataFrame with membership and network diameter columns
diameter_df <- tibble(
membership = unique_memberships,
network_diameter = unlist(results)
)Let us visualise the distribution of network_diameter of all the seafood subgraphs.
First, let us fix the order of the network diameter field by descending counts of network diameter.
Show the code
d <- highlight_key(diameter_df %>% arrange(desc(network_diameter1)))
p<-ggplot(data=diameter_df,
aes(x=as.factor(network_diameter1))) +
geom_bar(fill='lightblue') +
coord_flip() +
theme_minimal() +
labs(title = 'Distribution of Network Diameter',
x= 'Network diameter') +
theme(plot.title = element_text(face='bold'))
gg <- highlight(ggplotly(p),
"plotly_selected")
crosstalk::bscols(gg,
DT::datatable(d,options = list(iDisplayLength = 5)),
widths=5)There are about 32 subgraphs with network diameter of 4 and above. We can investigate these subgraphs closer.
4.1 Subgraphs with high NETWORK RISK
First, let us ingest the network diameter of each subgraph into the main seafood_nodes_vis dataframe by performing a left join with diameter_df using membership columns in both df as join key.
Show the code
seafood_nodes_vis<- seafood_nodes_vis %>%
left_join(diameter_df, by='membership',
unmatched = "drop") %>%
arrange(desc(network_diameter), desc(betweenness_centrality)) %>%
select(id,membership,network_diameter,betweenness_centrality,closeness_centrality,eigenvector_centrality,revenue_sum,country_qty,type_qty,label_qty,country_concat,type_concat,label_concat,product_services_concat,group,title)Is there a difference in the betweenness scores across network diameter?
Show the code
ggbetweenstats(data = seafood_nodes_vis, x = network_diameter, y = betweenness_centrality,
xlab = "network_diameter", ylab = "betweenness_centrality",
type = "np", pairwise.comparisons = TRUE, pairwise.display = "s",
sort = "descending",
sort.fun = median,
mean.ci = T, p.adjust.method = "fdr", conf.level = 0.95,
title = "Comparison of Betweenness centrality across different network diameters") +
scale_y_continuous(limits = c(0, 2500)) +
theme(axis.title.y=element_text(angle = 0,
vjust=0.9))
From the plot above, p value< 0.05 and we have evidence to conclude that the betweenness scores across network of different diameter are different. There are many entities with very high betweenness scores in subgraphs with network_diameter of 2, 4 and 6. However, in this section we will only focus on subgraphs with network_diameter of 4 and above because of more complex business relationships within them.
Next filter the subgraphs where network diameter is 4 and above.
There are only 2 subgraphs with network diameter of 6. They are subgraphs 112 and 227.
Show the code
Visualisation of the two subgraphs with highest network diameter.
Filter for all the nodes and edges in subgraph 112 from seafood_nodes_vis dataframe.
Visualise the network graph 112.
Show the code
set.seed(1234)
visNetwork(sub112_nodes_vis, #%>% filter(betweenness_centrality >= 20),
sub112_edges_vis,
main = "Network graph of subgraph 112 with diameter =6",
height = "500px", width = "100%") %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visEdges(smooth = list(enables = TRUE,
type= 'straightCross'),
shadow = FALSE,
dash = FALSE) %>%
visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visInteraction(hideEdgesOnDrag = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)In this subgraph, we can see three companies (blue) linked together by two individuals, namely Andrew Reed and John Hernandez.
Andrew Reed is a beneficial owner to both ‘Adair S.A. de C.V.’ and ‘Oka Ltd. Corporation Transport’.
John Hernandez is a Company contacts of ‘Danish Plaice Swordfish AB Shipping’ and beneficial owner of ‘Adair S.A. de C.V.’
The datatable containing details of the members in subgraph 112 allows us to filter them by betweenness scores and other attributes.
Show the code
Filter for all the nodes and edges in subgraph 227 from seafood_nodes_vis dataframe
Visualising the subgraph 227
Show the code
set.seed(1234)
visNetwork(sub227_nodes_vis, #%>% filter(betweenness_centrality >= 20),
sub227_edges_vis,
main = "Network graph of subgraph 227 with diameter =6",
height = "500px", width = "100%") %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visEdges(smooth = list(enables = TRUE,
type= 'straightCross'),
shadow = FALSE,
dash = FALSE) %>%
visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visInteraction(hideEdgesOnDrag = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)In this subgraph, we can see three companies linked by two individuals, namely Christopher Rodrigues and Lisa Brown.
Lisa Brown is beneficial owner to both ‘Tide NV solutions’ and ‘Deep Blue Cargo ship’.
Christopher Rodriguez is a beneficial owner of ‘Deep Blue Cargo ship’ and a company contacts of ‘Lewis PLC’. It seems to suggest that Deep Blue Cargo ship’ could have business dealings with ‘Lewis PLC’ via Christopher.
The datatable containing details of the members in subgraph 227 allows us to filter them by betweenness scores and other attributes.
4.2 High FINANCIAL RISK
In this section, we will look out for subgraphs with exceptionally high total revenue. As we expect bigger subgraphs to have higher revenue, thus the total revenue will be divided by the number of companies in the subgraph for fair comparison.
The code chunk below first group by membership and network_diameter columns to calculate sum of revenue called total_revenue_subgraph. It then sum up the number of times ‘Company’ appeared under the type_concat column to give us the number of distinct companies inside the subgraph. Finally, the total revenue for each subgraph is divided by the number of companies to give revenue_per_company field.
Show the code
financial_risk_nodes <- seafood_nodes_vis %>%
group_by(membership, network_diameter) %>%
summarize(
total_revenue_subgraph = as.numeric(sum(revenue_sum, na.rm=TRUE)),
no_of_companies_subgraph = sum(str_count(type_concat, "\\bCompany\\b")) - sum(str_count(type_concat, "\\bCompany Contacts\\b"))) %>%
ungroup() %>%
mutate(
revenue_per_company = ifelse(no_of_companies_subgraph == 0, total_revenue_subgraph, round(total_revenue_subgraph / no_of_companies_subgraph))) %>%
arrange(desc(revenue_per_company))The summary function shows signs of extreme outliers revenue_per_company value of $291,436,839. We might have to consider using log scale for box plots.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 12822 32167 1265696 72367 291436839
Visualising the distribution of revenue per company across different network diameters.
Show the code
library(scales)
ggplotly(ggplot(data = financial_risk_nodes %>%
mutate(network_diameter=as.factor(network_diameter)),
aes(x = reorder(network_diameter,
-revenue_per_company,
median),
y = revenue_per_company,
fill=network_diameter)) +
geom_boxplot(outlier.colour="blue",
outlier.size=1) +
geom_point(aes(label=membership),
position = 'jitter',
size=0.5) +
stat_summary(fun.y=mean,
geom="point",
shape=20,
size=2.5,
color="pink",
fill="red") +
xlab("Network diameter size") +
ylab("Revenue per company") +
ggtitle("Revenue per company by network diameter size") +
#scale_fill_brewer(palette='Set2') +
theme(plot.title = element_text(face= 'bold',
hjust = 0.5),
legend.position = 'none') +
scale_y_continuous(trans = log10_trans()))Hovering the mouse above the points will reveal their subgraph membership numbers. Surprisely, there are more smaller network diameter groups with higher revenue per company.
The top 10 subgraphs with the highest total revenue per company are listed below. The network diameters are mainly 1 or 2.
| revenue_per_company | membership | network_diameter | no_of_companies_subgraph |
|---|---|---|---|
| 291436839 | 396 | 1 | 1 |
| 131450837 | 577 | 2 | 0 |
| 95809780 | 316 | 2 | 1 |
| 63153107 | 173 | 2 | 1 |
| 55376193 | 193 | 1 | 0 |
| 52053645 | 162 | 4 | 2 |
| 32183079 | 581 | 2 | 1 |
| 1507514 | 498 | 2 | 1 |
| 1216029 | 553 | 2 | 1 |
| 1205868 | 48 | 1 | 1 |
Visualising the top 2 subgraphs in terms of revenue per company.
Subgraph 193
Show the code
m<- 396
sub227_nodes_vis <- seafood_nodes_vis %>% filter(membership==m)
sub227_edges_vis <- seafood_edges_vis[seafood_edges_vis$from %in% sub227_nodes_vis$id | seafood_edges_vis$to %in% sub227_nodes_vis$id, ]
set.seed(1234)
visNetwork(sub227_nodes_vis, #%>% filter(betweenness_centrality >= 20),
sub227_edges_vis,
main = "Network graph of subgraph 396 with diameter = 1",
height = "500px", width = "100%") %>%
visIgraphLayout(layout = "layout_nicely") %>%
visEdges(smooth = list(enables = TRUE,
type= 'straightCross'),
shadow = FALSE,
dash = FALSE) %>%
visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visInteraction(hideEdgesOnDrag = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)In this subgraph, Morgan Group commands a very high revenue of 291 million with only one business relationship with Jason Cole who is its Beneficial owner. Morgan Group sells fish (smoked or not smoked) products.
Information about the two members:
Subgraph 498
Show the code
m<- 577
sub227_nodes_vis <- seafood_nodes_vis %>% filter(membership==m)
sub227_edges_vis <- seafood_edges_vis[seafood_edges_vis$from %in% sub227_nodes_vis$id | seafood_edges_vis$to %in% sub227_nodes_vis$id, ]
set.seed(1234)
visNetwork(sub227_nodes_vis, #%>% filter(betweenness_centrality >= 20),
sub227_edges_vis,
main = "Network graph of subgraph 577 with diameter = 2",
height = "500px", width = "100%") %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visEdges(smooth = list(enables = TRUE,
type= 'straightCross'),
shadow = FALSE,
dash = FALSE) %>%
visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visInteraction(hideEdgesOnDrag = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)In this subgraph, the total revenue of the three individuals is 131 million recorded under the name of ‘WIlson LLC’. The product service offered by this company is ‘Fish, dried but not smoked’. ‘Wilson LLC’ seems like a company than company contacts or beneficial owner.
Information about the members:
4.3 High COUNTRY RISK
In this section, we will focus on subgraphs with business relationship involving many countries. This could usually indicate transshipment across carriers from different countries.
According to literature reviews, a method of IUU is via “Flags of convenience” where vessels fly different country flags at different location to avoid inspections and intersections by local governing bodies.
The code chunk below is a continuation from section 4.3. It create a new column no_of_countries_subgraph where it computes the number of distinct values of countries present in each subgraph.
Show the code
country_risk_nodes <- seafood_nodes_vis %>%
group_by(membership, network_diameter) %>%
summarize(
total_revenue_subgraph = as.numeric(sum(revenue_sum, na.rm=TRUE)),
no_of_companies_subgraph = sum(str_count(type_concat, "\\bCompany\\b")) - sum(str_count(type_concat, "\\bCompany Contacts\\b")),
no_of_countries_subgraph = n_distinct(na.omit(unlist(strsplit(country_concat, ", "))))) %>%
ungroup() %>%
mutate(
revenue_per_company = ifelse(no_of_companies_subgraph == 0, total_revenue_subgraph, round(total_revenue_subgraph / no_of_companies_subgraph))) %>%
arrange(desc(revenue_per_company))We will use a bar chart to visualise the count of countries in the business relationship.
Show the code
d <- highlight_key(country_risk_nodes %>% arrange(desc(no_of_countries_subgraph)))
p<-ggplot(data=country_risk_nodes,
aes(x=as.factor(no_of_countries_subgraph))) +
geom_bar(fill = 'lightblue') +
coord_flip() +
theme_minimal() +
labs(title= 'Distribution of \nnumber of countries \nin subgraphs',
x= 'Number of countries involved in a subgraph')
gg <- highlight(ggplotly(p),
"plotly_selected")
crosstalk::bscols(gg,
DT::datatable(d,options = list(iDisplayLength = 5)),
widths=5)Subgraph 6 is associated with the highest number of countries (10).
Subgraph 85 is associated with 4 different countries.
There are 15 other subgraphs that are associated with 3 different countries.
Subgraph 6 is very special because of the relationship between more than 10 countries so let us zoom in on its network graph.
Show the code
m<- 6
sub227_nodes_vis <- seafood_nodes_vis %>% filter(membership==m)
sub227_edges_vis <- seafood_edges_vis[seafood_edges_vis$from %in% sub227_nodes_vis$id | seafood_edges_vis$to %in% sub227_nodes_vis$id, ]
set.seed(1234)
visNetwork(sub227_nodes_vis, #%>% filter(betweenness_centrality >= 20),
sub227_edges_vis,
main = "Network graph of subgraph 6 with diameter = 4",
height = "500px", width = "100%") %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visEdges(smooth = list(enables = TRUE,
type= 'straightCross'),
shadow = FALSE,
dash = FALSE) %>%
visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visInteraction(hideEdgesOnDrag = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)In the above subgraph, there are two companies ‘Aqua Aura SE Marine life’ registered in countries (Mawazam, Rio Isla, Icarnia, Oceanus, Nalakond, Coralmarica, Alverossia, Isliandor, Talandria) and ’ BlueWater Bistro GmbH Industrial’ company registered in (Marebak).
Information about its members:
4.4 Finding similar subgraphs (groups)
In this section, we will use parallel coordinate plot to visualise and analyse multivariate, numerical data we have of each subgraphs. We will be comparing multiple variables
no_of_companies_subgraph: the number of distinct companies in the subgraphno_of_beneficial_owners_subgraph: the number of beneficial owners in the subgraphtop_betweenness: the highest betweenness score of an entity in the subgraphrevenue_per_company: the total revenue of the subgraph divided by the total number of companies in subgraph
Preparing the dataframe for the plot by creating two more new columns to calculate the number of beneficial owners and top betweenness score in each subgraph.
Show the code
prll <- seafood_nodes_vis %>%
group_by(membership, network_diameter) %>%
summarize(
total_revenue_subgraph = as.numeric(sum(revenue_sum, na.rm=TRUE)),
no_of_companies_subgraph = sum(str_count(type_concat, "\\bCompany\\b")) - sum(str_count(type_concat, "\\bCompany Contacts\\b")),
no_of_countries_subgraph = n_distinct(na.omit(unlist(strsplit(country_concat, ", ")))),
no_of_beneficial_owners_subgraph = sum(str_count(type_concat, "Beneficial Owner|Beneficial Owner, Company Contacts")),
top_betweenness = max(betweenness_centrality)) %>%
ungroup() %>%
mutate(
revenue_per_company = ifelse(no_of_companies_subgraph == 0, total_revenue_subgraph, round(total_revenue_subgraph / no_of_companies_subgraph)),
membership = as.factor(membership)) %>%
arrange(desc(revenue_per_company))We will filter subgraphs with network diameter 3 and above. The graph will also be faceted by the number of countries involvment in each subgraph. (Refer to the subtitle for the meaning of each box)
Show the code
set.seed(1234)
ggparcoord(prll %>%
filter(as.numeric(network_diameter) >=3) %>%
mutate(no_of_countries_subgraph=as.factor(no_of_countries_subgraph),
network_diameter=as.factor(network_diameter)),
columns = c(4,6:8),
groupColumn = 2,
scale = "uniminmax",
alphaLines = 0.3,
boxplot = TRUE,
title = "Parallel Coordinates Plot of subgraph variables") +
theme(plot.title = element_text(face='bold'),
legend.position = "right",
legend.text = element_text(size = 8),
legend.title = element_text(size = 10),
axis.text.x = element_text(angle = 35, hjust = 1),
axis.title.y = element_blank()) +
labs(subtitle='Box 1: 1 country, Box 2: 2 countries, Box 3: 3 countries, Box 4: 10 countries') +
facet_wrap(~ `no_of_countries_subgraph`)
Box 1 shows that there is one subgraph with network diameter =3 that has only 1 country in the relationship.
In box 2, the subgraphs with higher network diameter tend to have more beneficial owners in the relationship.
In box 3, there are subgraphs with more number of companies in the relationship. The number of beneficial owner, top betweenness score and revenue per company is also slightly higher than in box 2.
Box 4 shows that there is one subgraph with network diameter =4 that has 10 countries in the relationship. The number of beneficial owners and top betweenness scores in te subgraph are also observed to be the highest amongst all the boxes.
5 Conclusion
From the anomalies raised in the earlier sections, it would be worth investigating subgraphs with bigger network diamaters , higher revenue per company and subgraphs with multiple countries presence.
6 REFERENCES
Kam, T. S. (2022, December 4). R for Data Science. r4va. Retrieved from https://r4va.netlify.app/
Truelove, J. (2021, September 19). Hive plots with the ggraph and hiver packages. GitHub Pages. https://jtr13.github.io/cc21fall2/hive-plots-with-the-ggraph-and-hiver-packages.html
Nowak, B. J. (2021, September 30). R Network Analysis with Tidygraph. Netlify. https://bjnnowak.netlify.app/2021/09/30/r-network-analysis-with-tidygraph/
Ross, Z. (2019, April 2). Easy multi-panel plots in R using facet_wrap and facet_grid from ggplot2. ZevRoss Spatial Analysis Blog. http://zevross.com/blog/2019/04/02/easy-multi-panel-plots-in-r-using-facet_wrap-and-facet_grid-from-ggplot2/
